Skip to content

[improve][broker] optimize namespaceBundle validation to fix single-thread 100% CPU during unloading entire namespaces.#25626

Open
zhaizhibo wants to merge 6 commits intoapache:masterfrom
zhaizhibo:avoid_repeated_NamespaceBundles_construction
Open

[improve][broker] optimize namespaceBundle validation to fix single-thread 100% CPU during unloading entire namespaces.#25626
zhaizhibo wants to merge 6 commits intoapache:masterfrom
zhaizhibo:avoid_repeated_NamespaceBundles_construction

Conversation

@zhaizhibo
Copy link
Copy Markdown
Contributor

@zhaizhibo zhaizhibo commented Apr 30, 2026

Motivation

When validating namespace bundle ranges via validateNamespaceBundleRange, the method calls NamespaceBundleFactory.getBundles(NamespaceName, BundlesData) which constructs a new NamespaceBundles object every time. This object contains all bundles for the entire namespace, and its construction involves expensive string formatting operations (e.g., toString() for each bundle boundary).

For a namespace with N bundles, operations like unload or clearBacklog that iterate over all bundles will call validateNamespaceBundleRange for each one, resulting in O(N²) total construction work. For example, a namespace with 4000 bundles would require ~16,000,000 string operations during a single unload, as each of the 4000 bundle validations redundantly reconstructs all 4000 bundle objects.

The NamespaceBundleFactory already maintains a cache (bundlesCache) via getBundlesAsync(NamespaceName) that computes NamespaceBundles once and reuses it. Using this cache eliminates the repeated O(N) construction per validation, reducing the total work from O(N²) to O(N).

Modifications

  • Add validateNamespaceBundleRangeAsync(NamespaceName, String) that uses getBundlesAsync() (cached) instead of getBundles(NamespaceName, BundlesData) (re-constructed each call).
  • Change isBundleOwnedByAnyBroker to use validateNamespaceBundleRangeAsync, removing the BundlesData parameter since bundles are now fetched from cache rather than passed by the caller.
  • Replace the original usage of validateNamespaceBundleRange with the corresponding asynchronous call validateNamespaceBundleRangeAsync.

Verifying this change

  • Make sure that the change passes the CI checks.

  • Dependencies (add or upgrade a dependency)

  • The public API

  • The schema

  • The default values of configurations

  • The threading model

  • The binary protocol

  • The REST endpoints

  • The admin CLI options

  • The metrics

  • Anything that affects deployment

@zhaizhibo zhaizhibo changed the title [improve][broker] Avoid repeated NamespaceBundles construction in bundleRange validation. [improve][broker] optimize namespaceBundle validation to fix single-thread 100% CPU during unloading entire namespaces. Apr 30, 2026
@zhaizhibo
Copy link
Copy Markdown
Contributor Author

@AnonHxy @BewareMyPower @dao-jun @lhotari
Could you please help review this PR?

@lhotari
Copy link
Copy Markdown
Member

lhotari commented Apr 30, 2026

Could you please help review this PR?

Please fix any license or checkstyle issues first so that we can make progress with CI. You can check locally with this command:

./gradlew rat spotlessCheck checkstyleMain checkstyleTest

@zhaizhibo
Copy link
Copy Markdown
Contributor Author

Could you please help review this PR?

Please fix any license or checkstyle issues first so that we can make progress with CI. You can check locally with this command:

./gradlew rat spotlessCheck checkstyleMain checkstyleTest

fixed.😊

Copy link
Copy Markdown
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments

Comment thread pulsar-broker/src/main/java/org/apache/pulsar/broker/web/PulsarWebResource.java Outdated
@lhotari
Copy link
Copy Markdown
Member

lhotari commented May 1, 2026

Thanks for the contribution @zhaizhibo

zhaizhibo added 2 commits May 2, 2026 02:32
@lhotari
Copy link
Copy Markdown
Member

lhotari commented May 4, 2026

Local Claude Code review

These comments come from a local Claude Code review run by @lhotari, not from a maintainer review. Treat them as suggestions to investigate, not blocking decisions.

Summary

The optimization is correct and the algorithmic improvement (O(N²) → O(N) for full-namespace operations) is real. The cache invalidation story upstream is sound — NamespaceBundleFactory.bundlesCache is invalidated on bundle changes (NamespaceBundleFactory.java:190, :199, plus the handleMetadataStoreNotification listener at :180), so the cached validation is consistent with the prior policies.bundles-fed path in steady state.

A few items worth a closer look before merge:

1. Dead getNamespacePoliciesAsync fetches — NamespacesBase.java

In internalGetTopicHashPositionsAsync (around line 1601):

.thenCompose(__ -> getNamespacePoliciesAsync(namespaceName))
.thenCompose(policies -> {
    return validateNamespaceBundleOwnershipAsync(namespaceName, bundleRange, false, true)
            .thenCompose(...)
});

And in internalUnsubscribeNamespaceBundleAsync (around line 2077):

.thenCompose(__ -> getNamespacePoliciesAsync(namespaceName))
.thenCompose(policies ->
        validateNamespaceBundleOwnershipAsync(namespaceName, bundleRange, authoritative, false))

In both places the policies lambda parameter is no longer referenced anywhere in the body, but the getNamespacePoliciesAsync call still incurs a metadata round-trip per request. This partially defeats the PR's optimization goal for these two endpoints. Consider removing the step (or replacing with .thenCompose(__ -> ...)).

2. Indentation glitch in the new isBundleOwnedByAnyBroker lambda — PulsarWebResource.java:645-651

                    if (ExtensibleLoadManagerImpl.isLoadManagerExtensionEnabled(pulsar)) {
                       return nsService.checkOwnershipPresentAsync(nsBundle);   // 23 spaces (off by 1)
                    }
                    LookupOptions options = LookupOptions.builder()
                        .authoritative(false)                                    // 24 spaces; rest of file
                        .requestHttps(isRequestHttps())                          // uses 8-space continuation

The return line is off by one space, and the chained-builder lines use a 4-space continuation indent where the rest of the file uses 8. Worth re-running ./gradlew spotlessCheck checkstyleMain (you already had to fix Spotless once in 374ac3b1dc1).

3. Subtle 404 → 412 shift on missing namespace

In the call sites that removed getNamespacePoliciesAsync entirely (internalUnloadNamespaceBundleAsync, internalSplitNamespaceBundleAsync, internalClearNamespaceBundleBacklogAsync, internalClearNamespaceBundleBacklogForSubscriptionAsync), the prior getNamespacePoliciesAsync call also implicitly enforced "namespace exists in global config" (returns 404 on missing).

The new path goes through NamespaceBundleFactory.getBundlesAsyncloadBundlescopyToLocalPolicies, which on a missing namespace falls back to getBundles(namespace, Optional.empty()) — i.e. a default bundle layout. Then validateBundle(nsBundle) either succeeds (if the user-supplied range happens to match the default) or fails with IllegalArgumentException412 PRECONDITION_FAILED.

Net effect: clients calling these endpoints on a non-existent namespace will see 412 (or worse, accidental success on the default bundle range) instead of the expected 404. Worth either restoring an explicit existence check up front, or documenting the new error semantics. Also a good candidate for a test (see #5).

4. protected API break is risky to backport

The PR removes/renames protected hooks on PulsarWebResource:

  • validateNamespaceBundleRange(NamespaceName, BundlesData, String) removed
  • validateNamespaceBundleOwnership(NamespaceName, BundlesData, String, boolean, boolean) removed
  • validateNamespaceBundleOwnershipAsync(...) and isBundleOwnedByAnyBroker(...) lose the BundlesData parameter

These are protected, but custom forks subclassing PulsarWebResource for auth/admin REST extensions are common in the wild. Dropping these hooks in 4.0.11 / 4.1.4 / 4.2.2 patch releases will break downstream builds silently after a minor upgrade. For the patch-release backports specifically, consider keeping deprecated bridge methods that delegate to the new ones — or restrict this change to master/next-minor.

5. Zero new tests

The PR touches eight admin endpoints plus the protected helpers and is targeted for three patch branches, but adds no tests. At minimum:

  • An integration test asserting bundle validation still works correctly after a split via the cached path.
  • A test pinning down the error returned for a non-existent namespace (locks in whichever status code is intended — see Fixed ZookeeperCacheTest failures on Travis build #3).

6. Document the cache-correctness invariant — PulsarWebResource.java:598

A one-line comment near the new getBundlesAsync call referencing the cache invalidation invariant (NamespaceBundleFactory.bundlesCache is invalidated on local-policy notifications) would future-proof the code: anyone who later breaks the cache invalidation would otherwise silently start validating against stale boundaries.


Already addressed in newer commits — kudos:

  • The dead catch (RestException e) block (intermediate diff state) is gone in 294cfb32749.
  • The inconsistent thenComposeAsync in ResourceQuotasBase is now plain thenCompose.
  • Unused BundlesData import already cleaned up in 374ac3b1dc1.

@lhotari
Copy link
Copy Markdown
Member

lhotari commented May 4, 2026

@zhaizhibo Please check the above comment from a local Claude Code review. I think that breaking the protected method API is ok since the PulsarWebResource class isn't meant to be extended by external code.

@zhaizhibo
Copy link
Copy Markdown
Contributor Author

@lhotari
Re: CR "Subtle 404 → 412 shift on missing namespace"
The concern doesn't apply. All affected methods call validateGlobalNamespaceOwnershipAsync() first, which returns 404 for non-existent namespaces before reaching bundle validation. Added testBundleValidationWithNonExistentNamespace to verify this.
Re: CR "Missing tests"
Added two tests: testBundleValidationWithNonExistentNamespace (verifies 404 for non-existent namespaces) and testBundleValidationAfterSplit (verifies bundle validation after split).

@zhaizhibo zhaizhibo requested a review from lhotari May 6, 2026 02:31
@zhaizhibo zhaizhibo force-pushed the avoid_repeated_NamespaceBundles_construction branch from 9e11e84 to 1ea18a3 Compare May 6, 2026 02:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants